Previous Up Next

Cool Assembly Language

Cool Assembly Language is a simplified RISC-style assembly language that is reminiscent of MIPS Assembly Language crossed with x86 Assembly Language. It also features typing aspects that may remind one of Java Bytecode.

A Cool Assembly Language \(\bf program\) is a list of \(\bf instructions\). Each instruction may be preceded by any number of \(\bf labels\). Comments follow the standard Cool conventions. In addition, a semicolon \(\rm ;\) functions like a double dash \(\rm --\) in that it marks the rest of that line as a comment. The Cool CPU is a load-store architecture with eight general purpose registers and three special-purpose registers. For simplicity, a machine word can hold either a 32-bit integer value or an entire raw string; regardless, all machine words have size one.

This document assumes that you already have some familiarity with assembly language, registers, and how CPUs operate. We first present a formal grammar and then explain the semantics. Only terms in \(\rm typewriter\) font are part of the formal grammar. Text after \(\text{--}\) is a comment. We use \(\it italics\) for non-terminals.

\[ \begin{eqnarray} {\it register}\ &::=&\ {\rm r0} \text{ -- general purpose register #0, often used as the } {\it accumulator} \\ {\it register}\ &::=&\ {\rm r1} \text{ -- general purpose register #1} \\ {\it register}\ &::=&\ {\rm r2} \\ {\it register}\ &::=&\ {\rm r3} \\ {\it register}\ &::=&\ {\rm r4} \\ {\it register}\ &::=&\ {\rm r5} \\ {\it register}\ &::=&\ {\rm r6} \\ {\it register}\ &::=&\ {\rm r7} \\ {\it register}\ &::=&\ {\rm sp} \text{ -- stack pointer register} \\ {\it register}\ &::=&\ {\rm fp} \text{ -- frame pointer register} \\ {\it register}\ &::=&\ {\rm ra} \text{ -- return address pointer} \\ \end{eqnarray} \] \[ \begin{eqnarray} {\it instruction}\ &::=&\ {\rm li}\ {\it register} \text{ -- load immediate} \\ {\it instruction}\ &::=&\ {\rm mov}\ {\it register}\ < \!\!\!-\ {\it register} \text{ -- register-to-register copy} \\ {\it instruction}\ &::=&\ {\rm add}\ {\it register}\ < \!\!\!-\ {\it register}\ {\it register} \\ {\it instruction}\ &::=&\ {\rm sub}\ {\it register}\ < \!\!\!-\ {\it register}\ {\it register} \\ {\it instruction}\ &::=&\ {\rm mul}\ {\it register}\ < \!\!\!-\ {\it register}\ {\it register} \\ {\it instruction}\ &::=&\ {\rm div}\ {\it register}\ < \!\!\!-\ {\it register}\ {\it register} \\ \\ {\it instruction}\ &::=&\ {\rm jmp}\ {\it label} \text{ -- unconditional branch} \\ {\it instruction}\ &::=&\ {\rm bz}\ {\it register}\ {\it label} \text{ -- branch if equal to zero} \\ {\it instruction}\ &::=&\ {\rm bnz}\ {\it register}\ {\it label} \text{ -- branch if not zero} \\ {\it instruction}\ &::=&\ {\rm beq}\ {\it register}\ {\it register}\ {\it label} \text{ -- branch if equal} \\ {\it instruction}\ &::=&\ {\rm blt}\ {\it register}\ {\it register}\ {\it label} \text{ -- branch if less than} \\ {\it instruction}\ &::=&\ {\rm ble}\ {\it register}\ {\it register}\ {\it label} \text{ -- branch if less than or equal to} \\ {\it instruction}\ &::=&\ {\rm call}\ {\it label} \text{ -- direct function call} \\ {\it instruction}\ &::=&\ {\rm call}\ {\it register} \text{ -- register-indirect function call} \\ {\it instruction}\ &::=&\ {\rm return} \text{ -- function return} \\ \\ {\it instruction}\ &::=&\ {\rm push}\ {\it register} \text{ -- push a value on the stack} \\ {\it instruction}\ &::=&\ {\rm pop}\ {\it register} \text{ -- push a value off the stack} \\ {\it instruction}\ &::=&\ {\rm ld}\ {\it register}\ < \!\!\!-\ {\it register}\ {\rm [}\ {\it integer}\ {\rm ]} \text{ -- load a value from memory} \\ {\it instruction}\ &::=&\ {\rm st}\ {\it register}\ {\rm [}\ {\it integer}\ {\rm ]}\ < \!\!\!-\ {\it register} \text{ -- store a value into memory} \\ {\it instruction}\ &::=&\ {\rm la}\ {\it register}\ < \!\!\!-\ {\it label} \text{ -- load an address into a register} \\ \\ {\it instruction}\ &::=&\ {\rm alloc}\ {\it register}\ {\it register} \text{ -- allocate memory} \\ {\it instruction}\ &::=&\ {\rm constant}\ {\it integer} \text{ -- lay out a compile-time constant in memory} \\ {\it instruction}\ &::=&\ {\rm constant}\ {\it raw\_string} \text{ -- lay out a compile-time constant in memory} \\ {\it instruction}\ &::=&\ {\rm constant}\ {\it label} \text{ -- lay out a compile-time constant in memory} \\ \\ {\it instruction}\ &::=&\ {\rm syscall}\ {\it name} \text{ -- request a service from the run-time system} \\ \\ {\it instruction}\ &::=&\ {\rm debug}\ {\it register} \text{ -- debugging support: print register value} \\ {\it instruction}\ &::=&\ {\rm trace} \text{ -- toggle tracing} \\ \end{eqnarray} \]

That's it, and the last two do not really count. We next describe the interpretation of these instructions in more detail.

The system calls available are:

That system calls correspond directly to internal predefined methods on Cool Int and String objects. The key difference is that the system calls work on raw values (i.e., machine-level ints and strings) and not on Cool Objects.

Cool CPU Simulator

The normal Cool compiler executable (e.g., \(\rm cool.exe\)) also serves as a Cool CPU Simulator that executes Cool Assembly Language programs. Just pass \(\rm file.cl-asm\) as an argument.

The simulator performs the following actions:

  1. Load the \(\rm .cl-asm\) program into memory starting at address 1000. That is, if the first instruction in \(\rm file.cl-asm\) is \(\rm mov\ r1,\ r2\), then memory location 1000 will hold the instruction \(\rm mov\ r1,\ r2\). If the second instruction in \(\rm file.cl-asm\) is \(\rm constant\ 55\), then memory location 1001 will hold the integer 55.
  2. Set \(\rm sp\) and \(\rm fp\) to 2,000,000,000. Remember, the stack starts at high addresses and grows down.
  3. Search \(\rm file.cl-asm\) for a label named \(\rm start\). The program counter is set to the address corresponding to that label. For example, if \(\rm start:\) occurs just before the third instruction in \(\rm file.cl-asm\), then the program counter starts at 1002.
  4. Fetch the instruction pointed to by the program counter and execute it. Unless the instruction specifies otherwise, the program counter is incremented by one and the process repeats.
  5. When memory is allocated (e.g., by the \(\rm alloc\) instruction), addresses starting from at least 20,000 are used.
  6. If more than 1000 \(\rm call\) instructions are executed before any \(\rm return\) instructions are executed (i.e., if there are more than 1000 calls on the stack), the simulator terminates and prints a stack overflow error.

The constant values listed above (1000; 20,000; 2,000,000,000) should not be counted on by your program, but are listed here to help with debugging. Addresses near 1000 hold program instructions or compile-time data (i.e., the code segment), addresses near 20,000 hold the heap, and addresses near two billion are on the stack.

Debugging

Debugging assembly language programs is notoriously difficult! While writing your code generator, you will spend quite a bit of time running generated Cool Assembly programs through the Cool CPU Simulator to see if they work. Often they will not. The Cool CPU Simulator has been designed with a large number of features to aid debugging. Basically none of these features are present in traditional assemblers, so you actually have a wealth of debugging support, but it will still be difficult.

Control Flow Graphs

The Cool reference compiler also includes options to produce control flow graphic visualizations in the style of the dotty tool from the Graphviz toolkit.

Passing the \(\rm --cfg\) option (with, for example, \(\rm --opt\ --asm\)) produces \(\it method\)\(\rm .dot\), which can then be inspected via a number of tools. For example, this program:

class Main {
  main():Object {
    if (isvoid self) then  
      (new IO).out_string("cannot happen!\n")
    else 
      (new IO).out_string("hello, world!\n")
    fi 
  };
};

Might produce this control-flow graph:

While you do not have to match the reference compiler exactly, inspecting its control-flow graphs can help you debug your own code to create control-flow graphs.

Performance Model

As discussed above, the Cool reference compiler also includes a reference machine simulator to interpret Cool Assembly Language instructions. This simulator can be invoked directly by passing a \(\rm .cl \HY asm\) file to \(\rm cool.exe\):

cool$ cat hello-world.cl
class Main {
  main():Object {
    (new IO).out_string("hello, world!\n")
  };
};
cool$ ./cool --asm hello-world.cl
cool$ ./cool hello-world.cl-asm 
hello, world!

The simulator can also give detailed performance information:

cool$ ./cool --profile hello-world.cl-asm 
hello, world!
PROFILE:           instructions =        107 @    1 =>        107
PROFILE:        pushes and pops =         29 @    1 =>         29
PROFILE:             cache hits =         22 @    0 =>          0
PROFILE:           cache misses =        570 @  100 =>      57000
PROFILE:     branch predictions =          0 @    0 =>          0
PROFILE:  branch mispredictions =         11 @   20 =>        220
PROFILE:        multiplications =          0 @   10 =>          0
PROFILE:              divisions =          0 @   40 =>          0
PROFILE:           system calls =          2 @ 1000 =>       2000
CYCLES: 59356

The execution time of a Cool Assembly Language program is measured in simulated instruction cycles. In general, each assembly instruction takes one cycle. Some instructions, such as system calls or memory operation, can cost many more cycles. The total cycle cost of a program is the sum of all of its component cycle costs.

In modern architectures, memory hierarchy effects (e.g., caching) and branch prediction are dominant factors in the execution speed of a program. To give you a flavor for what real-world code optimization is like, the Cool Simulator also simulates a cache and a branch predictor.

The Cool Simulator features a 64-word least-recently-used fully associative combined instruction and data cache. It also uses a static backward = taken, forward = not taken branch prediction scheme.

We now discuss each of the performance components in turn:

  1. \(\bf instructions\). Each Cool Assembly Language instruction executed costs at least one cycle. This represents the time taken to fetch and decode the instruction, as well as to shepherd it through the pipeline. Instructions such as \(\rm li\), \(\rm mov\) and \(\rm add\) take one cycle.
  2. \(\bf pushes\ and\ pops\). Such \(\rm push\) and \(\rm pop\) involve both a load/store and also an add/sub, each costs an additional cycle (for a total of two). (\(\rm push\) and \(\rm pop\) can also incur cache miss penalties; see below.)
  3. \(\bf cache\ hits\ \&\ misses\). In modern computers, the CPU executes much faster than main memory: hundreds of "normal" instructions can be executed in the time it takes to fetch one value from memory. To mitigate this problem, a small number of values are placed in expensive, high-speed memory near the CPU. This small, fast memory stores recently-used values and is known as a \(\bf cache\). The Cool Simulator features a 64-word fully-associated cache: the values associated with 64 addresses can be accessed rapidly. If a memory read or write accesses an address that is in the cache, the instruction completes immediately with no extra cost. If a memory read or write accesses an address that is not in the cache, it costs 100 cycles while that value is read in from main memory. If there is no room in the cache to hold that new address's value, the address that has been touched (read or written) least recently is evicted and the new address/value is put in its place. Typical reasons for cache misses include compulsory, capacity and conflict. Note that the cache and cache miss penalty apply to every access to memory. This Includes:
  4. \(\bf branch\ prediction\ \&\ misprediction\). In a modern pipelined CPU, the next instruction is fetched before the current instruction has completed. This means that the CPU needs to know the address of the next instruction as early as possible. For a conditional branch, that may be difficult: the CPU may have to wait until the comparison is complete to determine if the next instruction will be at \(\rm pc+1\) or \(\rm label\). Modern CPUs optimistically "guess" or "predict" that a branch will go one way or the other and then rollback instructions if they are wrong. A correctly-predicted branch costs nothing; a mispredicted branch costs 20 cycles. The following instructions are related to this cost:
  5. \(\bf multiplication\ \&\ division\). Integer multiplication and division take longer on most architectures than addition and subtraction. In the Cool Simulator, \(\rm mul\) costs an extra 10 cycles and \(\rm div\) costs an extra 40.
  6. \(\bf system\ calls\). A system call involves trapping to the operating system, switching CPU protection contexts, putting the old process on the scheduling queue, handling the operation, rescheduling the new process, and switching CPU protection contexts again. System calls take forever. In the Cool Simulator, each \(\rm syscall\) instruction takes 1000 extra cycles.

This cost model involves realistic components but potentially unrealistic values (e.g., a modern CPU would have a much larger non-associative cache, and also a much larger cache miss cost). If you're interested in that sort of performance modeling, take a graduate class in computer architecture. You should know that this CPU performance model is one of the most realistic that I've seen for a compiler optimization project in terms of the issues that it forces you to address.

The reference compiler includes a simple reference peephole optimizer, as well as a few optimizations backed by dataflow analyses (liveness, reaching definitions, constant folding) and register allocation enabled via the \(\rm --opt\) flag. You can use it to get an idea for how to get started (but note that we are evil and strip all comments from the optimized output).

yuki:~/src/cool$ ./cool --opt --asm hello-world.cl
yuki:~/src/cool$ ./cool --profile hello-world.cl-asm 
hello, world!
PROFILE:           instructions =         79 @    1 =>         79
PROFILE:        pushes and pops =         23 @    1 =>         23
PROFILE:             cache hits =         15 @    0 =>          0
PROFILE:           cache misses =        513 @  100 =>      51300
PROFILE:     branch predictions =          2 @    0 =>          0
PROFILE:  branch mispredictions =          7 @   20 =>        140
PROFILE:        multiplications =          0 @   10 =>          0
PROFILE:              divisions =          0 @   40 =>          0
PROFILE:           system calls =          2 @ 1000 =>       2000
CYCLES: 53542

For the \(\rm hello-world\) program, this optimizer reduces the cycle cost from 59356 to 53453 -- a 10% improvement. If you are writing an optimizer, you will want to do at least as well as the reference, averaged over many input programs. Notably, you'll probably want to implement much more than the required dead code elimination optimization.

Previous Up Next